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Received approaches to a unified phenomenon called "language" are firmly committed 
to a Cartesian view of distinct unobservable minds. Questioning this commitment leads 
us to recognize that the boundaries conventionally separating the linguistic from the 
non-linguistic can appear arbitrary, omitting much that is regularly present during vocal 
communication. The thesis is put forward that uttering, or voicing, is a much older 
phenomenon than the formal structures studied by the linguist, and that the voice has 
found elaborations and codifications in other domains too, such as in systems of ritual 
and rite. Voice, it is suggested, necessarily gives rise to a temporally bound subjectivity, 
whether it is in inner speech (Descartes' "cogito"), in conversation, or in the synchronized 
utterances of collective speech found in prayer, protest, and sports arenas world wide. The 
notion of a fleeting subjective pole tied to dynamically entwined participants who exert 
reciprocal influence upon each other in real time provides an insightful way to understand 
notions of common ground, or socially shared cognition. It suggests that the remarkable 
capacity to construct a shared world that is so characteristic of Homo sapiens may be 
grounded in this ability to become dynamically entangled as seen, e.g., in the centrality 
of joint attention in human interaction. Empirical evidence of dynamic entanglement in 
joint speaking is found in behavioral and neuroimaging studies. A convergent theoretical 
vocabulary is now available in the concept of participatory sense-making, leading to the 
development of a rich scientific agenda liberated from a stifling metaphysics that obscures, 
rather than illuminates, the means by which we come to inhabit a shared world. 
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1. INTRODUCTION 

We speak with confidence of something called "language," as if 
this term referred to a single system, capable of multiple forms of 
manifestation (writing, speech, signing), but unified by organized 
structures and processes in the formal domains of phonology, 
morphology, syntax, and semantics. This emphasis on systematic- 
ity and symbolic encoding has utterly dominated the scientific 
view of "language" at least since the structuralist innovations 
of Saussure (1959/1916), and has been greatly reinforced by the 
pivotal role of generative linguistics in the birth of the cogni- 
tivist account of mind as a form of symbol-based information 
processing (Fodor, 1975). In the context of inter-personal com- 
munication, language, on this view, serves as a form of message 
passing, whereby ideas conceived in the mind of one person are 
encoded, first into words, and then into movements of mouth or 
hand, at which point they become transmittable to another, who 
sets about decoding them, thereby gaining access to the ideas of 
the sender. The message passing perspective on language is com- 
pelling, powerful, and supported by a host of technologies, from 
the very first forms of writing to the most sophisticated of digital 
platforms. 

The emphasis on symbols and systematicity allows the iden- 
tification of a tentative boundary between the linguistic and the 
non-linguistic. For example, a conventional distinction is drawn 



between phonological and non-phonological characteristics of 
the sounds of speech. Roughly, those features that support the 
identification of discrete categories such as phonemes, are taken 
as indices of linguistic structure, while non-categorical and con- 
tinuously varying features such as the loudness of a voice would 
lie beyond the notional bounds of language proper. Once discrete 
entities belonging to non-overlapping categories are available, 
they can be combined into larger symbolic structures, from 
syllables to novels. 

Language thus appears to be a clearly delineated and unified 
phenomenon, of which one can meaningfully construct theories. 
This leads to a compelling observation that there seems to be a 
yawning chasm between the many kinds of communication sys- 
tems found in animals and the generative, creative richness found 
in every human language. And so the foundations are laid for the 
perplexing observation that language seems to have appeared not 
so very long ago in an evolutionary timescale, and to have imme- 
diately enabled the development of the whole of human culture, 
technology, and all the institutions of all societies. 

Two related observations will serve to provide us here with a 
slightly different view of "language." The first is that the above 
story is fundamentally committed to an ontological split between 
mind and world. If we accept such a split, then meanings or ideas 
belong firmly in the realm of the mental, and they find expression 
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indifferently in writing or speech, each of which provides a kind 
of physical container for the passing of ideas from one mind 
to the next. The second observation is that the traditional story 
enforces a somewhat arbitrary divide between the linguistic and 
non-linguistic, motivated by the desire to ensure that language is 
systematic and supports the kind of symbolic operations familiar 
from syntax and related disciplines. If we observe communica- 
tion among people, we see many aspects to that behavior that 
never feature in linguistic theory, and that nevertheless seem to be 
reliably and essentially associated with inter-personal communi- 
cation. These two observations are related, because if we consider 
alternatives to the Cartesian mind/world split that divides ideas 
and meanings from sounds and movements, the apparent signifi- 
cance of many of the behaviors and features reliably and regularly 
attending communication may change, and with that, the bound- 
aries of "language" may shift, or, indeed, fragment, to reveal a 
variety of phenomena that do not admit of a single systematic 
description. 

I will argue that the way in which we conventionally treat 
of the phenomenon called "language" is overly restrictive, and 
seems more appropriate to the characterization of writing than 
speaking/listening (Linell, 2005). Older than writing by far is the 
voice, and the voice has remarkable properties all of its own. Chief 
among these is the obligatory association between the voice and a 
transient subject-pole that grounds intentionality. This, it seems 
to me, may be part of the reason the inner voice seems to be 
inextricably associated with the Cartesian subject. To develop this 
notion, I will turn to the substantive domain of joint, or collective, 
speaking, showing how collective speech engenders a different 
kind of subject, displaying collective intentionality. Furthermore, 
just as the voice of the individual admitted of development and 
codification in writing, so collective speaking admitted of devel- 
opment and codification in practices of liturgy and ritual. Written 
language, which is the more accurate target of modern linguis- 
tics, is thus not the only descendent of voice. The empirical study 
of collective speaking is in its infancy, but it reveals emergent 
phenomena that arise only in the real time reciprocal interac- 
tion of speakers speaking in unison. These emergent phenomena 
add substance to the argument that the traditional depiction of 
language as message passing mischaracterizes, or omits, much 
of what is going on in vocal communication (Cowley and Love, 
2006). It neglects the fluid intertwining of subjectivities that 
arises in real time reciprocal interaction, and that appears clearly 
in joint speaking. This only becomes apparent if we approach 
languaging (rather than language) as a set of multi-faceted 
behaviors that defy characterization from a single metaphysical 
viewpoint 1 . 

2. REVISITING DESCARTES 

Let us fancifully drop in on Descartes as he deduces his own exis- 
tence. The statement "Cogito, ergo sum" is without doubt the 
most famous line in Western Philosophy, and the basic outline of 



*A complementary account of languaging from an enactive perspective is 
provided in Bottineau (2010). This account adheres to a more conventional 
view of what the domain of language is than adopted here, but many of the 
fundamental concerns raised therein resonate with the themes of this article. 



the argument underlying it is overly familiar 2 . A skeptical philoso- 
pher, wishing to establish a foundation for true and certain 
knowledge, recognizes that the world of appearances, mediated 
by the senses may be illusory. He considers what remains after 
denying the testimony of the senses, and reasons thus: 

So after considering everything very thoroughly, I must finally 
conclude that this proposition, I am, I exist, is necessarily true 
whenever it is put forward by me or conceived in my mind 
(Meditation 2, AT 7:25). 

The "I" that is invoked here is explicitly and emphatically not a 
body, but a mind (7:27). The split between mind and world is 
absolute. Irrespective of how the consequences are played out, 
Descartes' certainty has become the split we have failed to dis- 
tance our selves from. Substance dualism narrowly conceived is, 
of course, not a respectable metaphysical position any more, but 
the split that is effected here between mind and world, and at 
the same time, between metaphysics and epistemology, far from 
being overcome, has become the foundational assumption upon 
which the whole of psychology (and more) has been built. As 
Sheets-Johnstone put it, it has become "a lexical band-aid cov- 
ering a 350-year-old wound generated and kept suppurating by a 
schizoid metaphysics" (Sheets-Johnstone, 1999, p. 275). 

But what is going on for Descartes? There is a voice. Whether 
it is a voice speaking in Latin "Cogito, ergo sum!" or a voice 
speaking in French "Je suis, j'existe!," it is a (silent) utterance — a 
thought in the form of words. Without language (better: lan- 
guaging), there is no such thought. Without a culturally specific 
history of vocal interaction among people during which meanings 
and uses of language emerge, there is no such voice. The solipsis- 
tic prison of Descartes' fancy is not so devoid of other people as 
he seems to believe, for in harboring the voice that can utter the 
"Cogito!," it is populated by the practice of Latin, or the practice 
of French. Closing the eyes does not keep out the world, and it 
does not keep out other people. 

The inner voice of linguistic thought that speaks here "to" 
Descartes is not different in kind from the outer voice of overt 
speech. Indeed, the whole metaphorical quagmire associated with 
the use of the terms inner and outer stems from the very confu- 
sion I wish to here circumvent. Vygotsky has presented a thorough 
argument that the overt but self-directed speech of young chil- 
dren is, firstly, a specialization of intersubjective social speech, and 
secondly, is the precursor to inner speech, or linguistic thought 
(Vygotsky, 1986). This insight provides us with an understanding 
of continuity between overt speech and silent speech, or linguistic 
thought. 

What if we choose to interpret Descartes' predicament some- 
what differently? Instead of considering the voice as evidence of 
a pre-existing subject, we might consider it to give rise to a tran- 
sient subjecthood. We cannot understand the occurrent thought 
as an utterance in the message-passing sense, as there are not two 
distinct domains, a speaker and a listener, for any message to be 
passed among. But we are now entertaining the tentative notion 



2 The famous Latin phrase does not appear in the Second Meditation, where 
the original argument is most clearly made. 
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that there is no Cartesian subject before the occurrence of the 
thought, and so any subjecthood associated with this utterance 
arises with the utterance and fades thereafter. This is not a fully 
fledged psychological subject, equipped with the mechanisms of 
"cognitive systems"; it is a subject-pole that allows a distinction 
between subject and world, or self and other, to be discerned, 
and that supports or invites the ascription of intentionality. It is 
a transient orientation, tied to the real time unfolding of the lin- 
guistic thought itself (". . . whenever it is put forward by me or 
conceived in my mind"). Later in the 2nd Meditation, Descartes 
himself seems to concur with this association of the Subject with 
the transient inner voice when he says "I am, I exist — that is 
certain. But for how long? For as long as I am thinking. For 
it could be that were I totally to cease from thinking, I should 
totally cease to exist" (Meditation 2, AT 7:27). Now the nature 
of "thinking" has not been generaly agreed upon, but the form 
of thinking Descartes here alludes to is clearly the utterance of an 
inner voice, in specific words, words which he is capable of repeat- 
ing to us, words which we can characterize as Latin or French. I 
wish to pursue this idea, that voice gives rise to the complemen- 
tarity between the poles of subject and world, and it does so in 
real time. 

3. VOICES AND SUBJECTS 

[V]oice is a kind of sound of an ensouled thing. For none of the 
things without soul gives voice, though some are said by analogy 
to give voice, such as the flute and the lyre and whatever other of 
the things without soul have the production of sustained, varied 
and articulate sound. For voice also has these features and so there 
is a likeness (Aristotle, 1986, 420b, p. 178). 

The association between the animate (even ensouled) subject and 
the voice is ancient. In Connor (2000) the long history of the sub- 
jects perceived as being behind voices emanating from unlikely 
places is recounted in detail. From the Delphic oracle through 
the medieval fascination with demonic possession, prophecy, and 
divine inspiration, voices perceived as coming from the stomach, 
the genitals, or even a crack in the rock have been enthusiastically 
attributed to invisible subjects, rather than to sound-producing 
properties of either inanimate objects or of atypical parts of the 
body itself. Much of the ghoulish fascination that the ventril- 
oquist's dummy attracts lies in the obligatory projection of a 
subject behind the grotesque appearance. Connor writes: 

For I produce my voice in a way that I do not produce these other 
attributes [eyes, hair, gait, fingerprints, etc]. ... giving voice is 
the process which simultaneously produces articulate sound, and 
produces myself, as a self-producing being (Connor, 2000, p. 3). 

It is telling that the words uttered in one of the very earliest sound 
recordings, made by Alexander Graham Bell in 1881, are "T-r-r — 
T-r-r — There are more things in heaven and earth Horatio, than 
are dreamed of in our philosophy — T-r-r — J am a Graphophone 
and my mother was a Phonograph" (Volta Laboratory, 2013, 
Emphasis added), thus instinctively investing one of the very first 
disembodied voices born of technology with subjecthood of its 
own. Remarkably, the telephone and the phonograph came into 



being almost simultaneously — in 1876, 1877. Add to these the 
advent of radio transmission of the human voice, first done in 
1900 in Brazil, and it is clear that we have been awash in disem- 
bodied voices for over a 100 years and counting. The irritating 
proliferation of pseudo-personalities such as the iPhone's Siri 
seems likely to continue. 

If the voice Descartes conjures up alone generates a subjectivity 
that is aligned with the classic subject-object distinction at the 
level of the single individual, then we might give consideration 
to the possibility that voice employed in different circumstances 
might generate other forms of subjectivity, without commitment 
to individual Cartesian minds. 

3.1. SHARED SUBJECTIVITY AND COMMON GROUND 

When an utterance is made in a specific context with speaker and 
listener both present, it is interpreted in the light of the shared 
understanding of all parties. This has found expression in theo- 
retical notions of common ground (Clark and Brennan, 1991), or 
socially shared cognition (Schegloff, 1991). Most developments 
of the idea of common ground are couched within the informa- 
tion processing/message passing framework, and therefore make 
use of some version of aligned or shared representational content. 
However, it is not necessary to appeal to such unobservable con- 
structs from a hidden Cartesian world (Hutto and Myin, 2013). 
There is ample evidence that participants in a conversational 
exchange become mutually linked in many subtle but observ- 
able ways. Eye movements (Richardson et al., 2007), postural 
sway (Shockley et al, 2009), and even blinking (Cummins, 2012) 
have all been found to become subtly intertwined in conversation, 
leading to a dynamic entanglement of the participants. Speakers 
and listeners are further linked through the provision by the latter 
of signals of ongoing engagement through postural, gestural, and 
vocal indices or backchannels (Wagner et al., 2014). 

The yoking together of two or more people engaging in lan- 
guage behavior establishes a common basis from which the par- 
ticipants confront the world. It makes available a shared frame- 
work within which statements can be interpreted. It thus provides 
a scaffold for shared intentionality (Carr, 1987). The ability to 
share an intentional perspecitive seems to be at the very heart 
of human language use, but it is not an all or nothing affair. 
Two protesters with common purpose who chant the same slogan 
demonstrate an extreme alignment with respect to the world. But 
two people engaged in heated disagreement must still achieve a 
great deal of alignment in order to disagree felicitously. The topic 
of disagreement must be foregrounded, at the expense of every- 
thing else. In disputing causal chains, in laying out competing 
sequences of events, and in presenting different interpretations 
of the significance of actions and events, two disputants are nec- 
essarily sharing a great deal of background framing, picking out 
these events rather than those, identifying the same actors, while 
quarreling over their respective roles. Even in the absence of 
conversational exchange, people observing the same scene exert 
reciprocal influence on one another, such that their gaze behav- 
ior, and by inference, the details they pay attention to, become 
inter-dependent. In a series of experiments summarized in Dale 
et al. (2013) gaze behavior of subjects are demonstrated to depend 
sensitively on the presence of others, and on whether one subject 
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knows or believes that the others are seeing and hearing the same 
things as they are. 

If joint languaging provides a very powerful example of inten- 
tional alignment, then it might be that that the ability to coordi- 
nate the manner in which we jointly pay attention to the world is 
an important skill that facilitated the emergence of such behavior, 
as argued in Fusaroli and Tylen (2012). Sometime between the 
last speciation event some 5 or 6 million years ago that gave rise 
to chimpanzees and bonobos on the one hand, and the hominid 
line on the other, something happened that had profound con- 
sequences for our ability to share perspectives and to coordinate 
with one another. There is one small biological change that we 
know occurred in that time, that might play a significant role 
here. That change gave rise to the white sclera of the human eye 
that contrasts vividly with the darker iris, thus providing a very 
clean signal of the direction of gaze of a partner (Tomasello et al., 
2007). The other great apes do not have such a contrast, and their 
ability to align their gaze is severely limited, and based on head 
direction rather than the eyes — although chimpanzees and bono- 
bos in particular do display some evidence of understanding the 
visual perspective of another (Okamoto-Barth et al., 2007). The 
ability to follow each other's gaze thus facilitates the sharing of 
attention, and has been demonstrated to structure mother-child 
interactions, while inducing the abilty to take part in languaging 
(Tomasello and Farrar, 1986). 

As common ground is established, the subjective point from 
which utterances are spoken also shifts. Vygotsky has pointed out 
how the (linguistic) subject becomes an implied, rather than an 
overt, element in speech once common understanding has been 
established (Vygotsky, 1986, p. 236). For example, it would be odd 
to respond to the question "Would you like a cup of tea?" with 
the answer "No, I don't want a cup of tea," instead of simply "No." 
Similarly, a group of people waiting for a bus establishes suffi- 
cient shared context that no one is likely to point out the obvious 
and say "The bus for which we are waiting is coming," but simply 
"coming" or some such expression. The dropping of the linguistic 
subject is more extensive yet in inner speech, of which Vygotsky 
says "it is as much a law of inner speech to omit subjects as it is 
a law of written speech to contain both subjects and predicates" 
(Vygotsky, 1986, p. 243). Many languages allow dropping of any 
explicit mention of the subject once they can be inferred on prag- 
matic grounds. This is not merely a syntactic quirk of one group 
of languages, as it is found in such typologically distant languages 
as Japanese, Chinese, Turkish, and Spanish (Huang, 1984). 

It would be a mistake to simply equate the subject pole of 
a subject-world complementary pair with the syntactic subject, 
but it would be inexcusable too to ignore the deep link between 
the fundamental linguistic structure of subject and predicate on 
the one hand and the subjective pole from which utterances are 
brought forth on the other. The subject pole that arises in the 
unfolding of the voice grounds intentionality, and provides an 
anchoring point for reference. This is, perhaps, most explicit in 
the manner in which deixis functions, allowing use of terms such 
as "there," "here," "then," "now," whose meaning is anchored in 
the joint situation created by conversational participants; It is 
also explicit in the manner in which the first personal pronouns, 
both singular and plural, find flexible, and context-specific use. 



It is implicit too in establishing a shared register and perspective 
within which meaning is negotiated. The differentiation of sub- 
ject and world, and the ability to establish a shared perspective 
within which utterances function, precedes any overt syntactic 
knowledge or awareness by millennia (Olson, 1996). 

3.2. ALIGNMENT vs. SYNERGY 

The dynamic intertwining of conversational participants interact- 
ing in real time has not gone unnoticed. An influential approach 
to account for the many overt and subtle ways in which two inter- 
locutors become linked is found in the Interactive Alignment 
model of Pickering and Garrod (2004, 2014). This model seeks 
to describe the tendency for conversational partners to imi- 
tate one another at a variety of levels, from syntactic biasing, 
through lexical selection, down to the level of phonetic, and 
gestural imitation. The idea that similarity in one domain can 
unconsciously bleed through representational levels to generate 
similarity in other domains provides some explanatory purchase 
on a great deal of corpus-based data. As a general account 
of the dynamic coupling and mutual accommodation found 
among speaker/listeners, however, it is somewhat limited. It leaves 
language resolutely within the heads of individual conversing 
partners, and this does not move beyond the Cartesian, represen- 
tationalist framework. It is "representation-hungry," demanding 
computational representations at many levels, and indeed, in its 
most recent form, it conjures up a baroque series of simula- 
tions inside the heads of individuals who must not only act, but 
also predict the actions of others (Pickering and Garrod, 2014). 
This approach does not generalize in any obvious way to multi- 
party conversations. Nor does it account for coupling among 
interactants that are not strictly imitative in nature, as with the 
mutual influence exerted on blinks (Cummins, 2012). The ten- 
dency to alignment suggests that felicitous conversation would 
result in mere mimicry, which is again not what we observe, and 
it privileges similarity, at the expense of complementarity, thereby 
missing the fundamental role-based nature of conversation in 
which the positions of speaker and listener alternate. 

A competing account has recently been proposed that regards 
inter-personal coordination in dialog as a form of synergy or 
dynamical coupling (Fusaroli et al., 2014). This approach is 
rooted in dynamical approaches to coordination that are level- 
agnostic, seeking to understand emergent phenomena at one level 
(e.g., the dyad) as arising through processes of self-organization 
from the constrained interaction of autonomous components at 
a lower level (the speaker/listeners) (Kelso, 1995; Latash, 2008). 
This approach highlights the sensitivity of participants to real 
time recurrent interaction, as is evident even in the early inter- 
actions of infants and mothers (Murray and Trevarthen, 1986). 
It emphasizes the intertwining of the movements of participants, 
leading to dimensional reduction, so that two interacting per- 
sons become, temporarily, a simpler collective entity than the 
two persons considered as a mere conjunction of individuals. 
It acknowledges both synchronized and complementary actions 
as they contribute to this simplification, and it emphasizes the 
manner in which shared understanding of task constraints leads 
to stability of patterning in time. Although still somewhat spec- 
ulative, this level-independent approach seems commensurate 
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with the approach to be developed here that treats groups of 
people as synergetically organized domains in their own right, 
with respect to which subjectivities of a collective nature can be 
identified. 

Synergistic approaches to human communication have been 
argued for by others. Thibault (20 1 1 ) adopts a position not unlike 
the present one in which a fundamental distinction is drawn 
between what he calls talk and text. The role of voice described 
both here and in his work emphasizes the bodily entrainment 
that arises at a very fine scale among interactants, while the prop- 
erties that linguists conventionally consider, and that admit of 
a computational description, constitute a distinct, and second- 
order set of phenomena. Although not focussed on languaging, 
Riley et al. (2011) argue that interpersonal movement coordina- 
tion is the result of establishing interpersonal synergies of the 
sort described here, and they distinguish between component- 
dominant dynamics, as portrayed within a cognitivist framework, 
with interaction-dominant dynamics in which the autonomy of 
the level of interaction is more thoroughly acknowledged. Finally, 
the perceptual crossing paradigm introduced by Auvray et al. 
(2009) provides a minimalist experimental set up in which two 
people interact in real time in a minimal virtual space. While not 
communicative in any conventional sense, the nature of the emer- 
gent behavior observed serves to illustrate the principal point 
being made that the interaction itself constitutes a level of relative 
autonomy that is not reducible to the conjunction of proper- 
ties of its components (Froese et al., 2014). These latter two 
examples illustrate that social interaction and languaging are not 
separate phenomena. Languaging is a constitutive part of the 
manner in which interpersonal entrainment or coupling arises in 
the moment by moment real time reciprocal interaction among 
people. 

3.3. VOICE vs. WRITING 

Before giving further consideration to the relationship between 
subjecthood and voice, it is appropriate to recall the vast chasm 
that separates speech from writing, not least as the claim is 
made here that most of the phenomena described by modern 
linguistics relate, in fact, to the structure of written communi- 
cation, and are only indirectly relevant to the act of speaking, 
which is the central form in which languaging is manifested 
(Linell, 2005). Since the advent of alphabetic writing in Greek 
society, a naive view has been available that writing is simply 
a device for transcribing speech. Olson (1996, p. 66) identifies 
overt statements that express this view from Aristotle, Saussure, 
Bloomfield, and more. This is why theories of syntax, morphol- 
ogy, and semantics, that together delimit much of that which we 
call "language," allow themselves to study and model the formal 
characteristics of symbol strings, without consideration of the 
medium of expression. This insensitivity to the enormous differ- 
ences between writing and speech underlies the focus by Saussure 
on langue rather than parole, and by Chomsky on competence, 
rather than performance. With that, modern linguistic theory has 
turned its attention away from the most common form of lan- 
guaging, indeed the only one that existed from the fuzzy origins 
of speech until the relatively recent development of writing and 
the even more novel phenomenon of mass textual proliferation. 



It has ignored the real time reciprocal interaction among people 
giving voice from context-specific situations of concern. 

We have now a wealth of research that documents very sub- 
stantial changes that arise with the advent of writing, and espe- 
cially with the spread of literacy consequent to the development 
of printing. These changes affect not only the way language is 
used, but the very structure of the consciousness of language users 
(Stewart, 2010). Ong (1982) provides an authoritative and com- 
prehensive catalog of differences between the way knowledge is 
managed, shared, and verbalized in primary oral cultures, and 
in highly literate ones. Olson (1996) further documents the pro- 
found conceptual and cognitive implications of the spread of 
literacy. Much of this work focusses on the novelties that accom- 
pany writing and literacy. McLuhan claimed that "writing was an 
embalming process that froze language" (McLuhan, 1964), and 
he provides an anectode from Prince Modupe, who speaks of his 
encounter with the written word in his West African days: 

The one crowded space in Father Perry's house was his book- 
shelves. I gradually came to understand that the marks on the 
pages were trapped words. Anyone could learn to decipher the sym- 
bols and turn the trapped words loose again into speech. The ink 
of the print trapped the thoughts; they could no more get away 
than a doomboo could get out of a pit. . . (McLuhan, 1964, p. 84). 

With writing, texts achieve an independence from their sources. A 
spoken utterance is necessarily vouched for by the speaker, while 
a written sentence asserts, without the contingency and commit- 
ment of a speaker. I have mentioned that voice gives rise to a 
subjective pole. Here we can see that the complement is also true: 
writing gives rise to a particular kind of objectivity, one in which 
for the first time it is possible to have "facts that speak for them- 
selves" (Latour, 2013). (For an insightful account of several ways 
in which objectivities are constructed, see Daston and Galison, 
2007). Written sentences remain immutable and thus support dis- 
section and analysis in a way that spoken utterances, which must 
be articulated each time they come into being, do not. The further 
development of speech and language technologies in the service 
of message passing has given rise to forms of spoken langauge, 
e.g., in news broadcasts or public service announcements, that 
bear greater similarity to written texts than to spoken utterances, 
while recent increases in the possibility of text-based reciprocal 
exchanges, e.g., in SMS messaging, further serve to complicate the 
relation between voices, texts, messages, and intentions 3 . 

It is interesting in this regard to consider the constraint 
observed by Everett to hold in the language of the Piraha, an 
Amazonian tribe whose language is remarkable in its simplic- 
ity and omissions, having no counting system, very restricted 
tenses, arguably no syntactic recursion, etc. The Piraha also have 
no mythology or stock of fiction. Everett attributes many of 
these constraints to what he calls the Immediacy of Experience 
Principle, according to which statements by the Piraha "contain 
only assertions related directly to the moment of speech, either 



3 My thanks to the anonymous reviewer who pointed out that the stark 
dichotomy between spoken and written texts has become considerably more 
complex. 
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experienced by the speaker or witnessed by someone alive dur- 
ing the lifetime of the speaker" (Everett, 2009a, p. 132). Here, the 
strong tie betwen the speaker and the words spoken appears to 
have become sedimented into the very structure of the language 
and culture, leaving no room for the disembodied words found in 
writing. It is perhaps no coincidence that Everett's observations 
have become controversial precisely among those linguists who 
hold syntax, and syntactic recursion in particular, to be central to 
the very nature of language (Hauser et al., 2002; Everett, 2009b). 

4. SPEAKING IN UNISON 

The act of speaking in unison is a common form of vocal behavior 
that is accorded no particular theoretical significance in a message 
passing view of language. On the received view, minds and sub- 
jects are closed and singular; thus many people saying the same 
thing at the same time appears merely as a multiplication of the 
individual speaker. The behavior does seem somewhat perplexing 
though, for what message is being passed if we all know the words? 
It is worthwhile to consider both the occasions in which people 
often speak in unison, and the form of the speech so produced. 

"Joint speaking" is an umbrella term I have coined to cover all 
occasions in which the same words are uttered by multiple people 
in unison (Cummins, 2013a). This includes many practices of col- 
lective prayer, the chants of both protest demonstrators and sports 
fans, the recitations of young school children, performances of 
choral speech, and the swearing of collective oaths in secular con- 
texts. To all these naturally occurring variants we can also add the 
simultaneous reading of novel texts by pairs (or more) of speak- 
ers in the laboratory in a paradigm known as Synchronous Speech 
(Cummins, 2003, 2009). 

This brief survey of situations in which people speak in unison 
makes it clear that this behavior is very widespread, and is found 
in virtually every culture. It is thus a central, and not a periph- 
eral, example of languaging. With the exception of joint speaking 
in classrooms, which serves a multitude of purposes imposed 
by educational authorities rather than expressing any sentiment 
of the speakers, all of the naturally occurring forms of joint 
speech are found in situations in which the attribution of col- 
lective, shared, intentionality seems to straightforwardly capture 
the significance of the practice for participants. In prayer con- 
texts, collective speaking testifies to shared beliefs. In protest, the 
shared purposes of the crowd are made manifest through chant- 
ing. Among sports fans, chants are a means by which collective 
identity is sustained and asserted. None of this is at all surpris- 
ing, nor in need of precise definition — at least, no more precise 
than seems warranted for the attribution of beliefs, desires, and 
intentions to individuals. While we may not all be enthusiastic 
chanters, even a reluctance to join in such behavior testifies to 
the obligatory assocation of such voicings with the underlying 
sentiments. 

But if message passing does not illuminate such behavior, it 
seems fair to ask how we might better characterize it; why are 
people engaging in such vocal activity, if not to pass ideas around? 
While there is probably not a single answer to this question, a use- 
ful conceptual approach suggests itself from the theory of speech 
acts (Austin, 1975). Austin noted that many utterances achieve 
something simply by virtue of being spoken. Examples include 



"I pronounce you man and wife," or "I apologize for my behav- 
ior." Such utterances he called "performatives." In the treatment 
provided by Austin, they are frequently signaled by such verbs as 
"pronounce," "decree," "promise," etc. The set of performatives 
Austin alludes to, and the associated set of acts performed is very 
restricted. If there is merit to the idea that uttering gives rise to 
the complementary poles of subject and world, then all utterances 
might properly be considered to be performatives, and the estab- 
lishment of a transient subject pole with an implicit intentional 
structure would then be an achievement of the act of uttering. 
This approach to understanding joint speech helps to make sense 
of some of its most reliable features. In what follows I will consider 
mainly the three most common forms of joint speech 4 : collective 
prayer, protest chanting, and sports chanting. 

All three forms of joint speech are frequently, almost 
inevitably, characterized by repetition: the same phrase or short 
verse is repeated tens, or even hundreds of times over. Repetition 
makes sense if the temporally bound act of utterance is required 
to establish and maintain a transient subject pole with respect 
to which we can identify beliefs or intentions. Repetition is 
undergirded by physical actions such as fist pumping, bead twid- 
dling, or arm waving. While bead manipulation is relatively 
private, the more macroscopic actions further serve to facilitate 
synchronization among participants. 

Repetition also serves to accentuate and exaggerate the rhyth- 
mic properties of utterances, while repetition of a short phrase 
can also induce a change in perception from speech to song 
(Deutsch et al, 2011). In repeated spoken chants, the form of 
speech that arises thus blends seamlessly into the musical domain, 
establishing a continuity between speech and music. The close 
relation between spoken and sung chant is signaled by the very 
ambiguous nature of the word "chant" in English which applies 
with equal facility in either domain. It is interesting that a focus on 
collective speech makes a continuum between speech and music 
appear natural, even obligatory, while the message passing per- 
spective as articulated most clearly by Pinker (1999) insists on 
an absolute divide between the two domains. On the message- 
passing view, speech is an expression of the highly valued notional 
faculty of language, and thus central to our human minds, while 
music is denigrated as "auditory cheesecake," with no — from 
his perspective — apparent functional significance, thus merit- 
ing being grouped together with artistic expression, cheesecake, 
and pornography (Pinker, 1999). If anything illustrates the lim- 
ited capacity to describe, or even see, that the message passing 
perspective induces, surely it is this failure to appreciate the con- 
tinuum we are all familiar with that extends from instrumental 
music, through song, rap, poetry, rhymes, rhetoric, and chant 
(Cummins, 2013a). We might note in passing that the contrast 
between the real time participatory nature of the voice that is 
here contrasted strongly with the frozen nature of writing finds 
a strong parallel in contemporary discussion of the relationship 
between live musical performance and recording (Chanan, 1995). 

We like to speak of the "wisdom of crowds," but the rather 
more familiar notion of the ignorance of the mob, whose powers 



4 1 hypothesize — I am not sure how one might measure relative frequency 
here. 
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of reason are not to be trusted, is perhaps more apt for many of 
the situations under consideration. While groups have frequently 
been found to outperform individuals in tasks of judgment and 
estimation (Koriat, 2012), groups involved in joint speech of 
protest are often found in volatile situations where collective 
actions are rudimentary and aggressive. It is worth noting though 
that some degree of sophistication in the beliefs that are jointly 
articulated is provided by the formal scaffold of call and response. 
The device of having a single leader call a series of questions to 
which the crowd provides a series of responses is found in both 
prayer and protest, though perhaps less so in sports chants. In 
prayer, this sequence of leading call and collective response is 
often formalized into liturgical rites, allowing for a great deal of 
complexity in the beliefs that are thereby expressed. In protest, it 
is far more common to see only a single call, and a single response, 
and the very nature of protest mitigates against the kind of codi- 
fication found in ritual liturgical practices. Sports chanting seems 
to be more concerned with the demonstration of collective iden- 
tity than with the formulation of explicit statements of belief or 
intention, and call-and-response chants are less common. 

If we view writing as an elaboration of some aspects of speak- 
ing, i.e., a technological extrapolation that gives rise to a formal 
system of the kind studied under the somewhat misleading label 
of "language," then we might observe that vocal behavior, or 
languaging, appears to have other extrapolations, other forms 
of extension, and other forms of codification, so that the for- 
mal constructs of the linguists are not the only descendents of 
the voice. Collective speech has found integration into rituals in 
a great diversity of traditions. The Abrahamic religions all for- 
malize collective speaking within their respective services, and in 
each of them the rituals integrate joint speaking into a carefully 
orchestrated sequence of complementary acts by service leaders 
and participants that include highly stylized sequences of move- 
ments such as bowing, kneeling, marching, etc. Other religious 
traditions have engaged in similar forms of codification (Bell, 
1988). Parallels between linguistic grammar and ritual structure 
have previously been noted (Michaels et al., 2010), but the prin- 
cipal point argued here is that voice has given rise to more than 
one species of formalization. Liturgy and ritual do not admit of 
the same generative mutability as freely spoken or written text, 
but by codifying such utterances in collective speech and rit- 
ual, the implicit intentional structure that arises in speaking and 
performing, together with the associated belief structure, is sta- 
bilized. With such observations, the boundaries of "language" 
become somewhat less determinate, and the subjects that find 
voice become both more numerous and more varied. 

4.1. DYNAMIC ENTANGLEMENT IN SYNCHRONOUS SPEAKING 

If the relation between voice and subjectivity put forward here 
has merit, joint speaking appears as an extreme example that can 
serve to hone our considerations of the form and nature of col- 
lective intentionality. In monolog, I alone dictate the intentional 
ground of my utterances; in conversation, the shared ground 
is fluid and negotiated; in chanting it is immovable. Are there 
then any signatures of joint intentionality that we can observe? 
In the spirit of the dynamical coupling hypothesis of Fusaroli 
et al. (2014), we might look for evidence that joint speakers 



are strongly coupled, giving rise to emergent phenomena at the 
supra-individual level. 

In a series of behavioral studies in which speakers are asked 
to read novel texts in unison, no major differences that would 
serve to pick out speech as collective based on its acoustic char- 
acteristics alone have been observed (Cummins, 2014). Speech 
produced in these constrained laboratory settings is remarkably 
unremarkable, and the technique of having subjects speak in 
synchrony has been used as a device for obtaining unmarked 
speech in several phonetic studies (Krivokapic, 2007; Kim and 
Nam, 2008; O'Dell et al, 2010; Dellwo and Friedrichs, 2012). 
The unmarked phonetic structure of speech elicited in the syn- 
chronous speaking situation contrasts strongly with the obser- 
vation that texts recited in ritual and rite are frequently, if not 
inevitably, highly stylized in prosodic form. For example, consider 
the typical pattern with which the Hail Mary is said when recit- 
ing the rosary, or, in a secular context, the characteristic form of 
the Pledge of Allegiance as recited by American schoolchildren. 
Prosodic stylization thus appears as a reliable, but not necessary 
characteristic of joint speech. 

There is one form of speech error found in a synchronous 
speech task that seems to be unique to that situation, and that 
illustrates a strong dynamic coupling between speakers. When 
one speaker makes a speech error, it is frequently, though by 
no means always, observed that both speakers stop speaking 
simultaneously. Sometimes this abrupt cessation can even be in 
mid-syllable. Abrupt and simultaneous cessation of speech seems 
to be unique to this situation, and I have previously compared 
it to the collective tumbling that happens so readily in a three- 
legged race if either participant makes a misstep (Cummins et al., 
2013). This seems to suggest that the task of synchronizing leads 
to a close intertwining of the process of speech production by 
each speaker, leaving each vulnerable to mistakes by the other. 
This observation might be tempered, however, by noting that 
the degree of synchronization found in the laboratory is typically 
much greater than that found in the wild, where relatively loose 
temporal alignment is common and tolerated. 

A second source of empirical phenomena associated with joint 
speaking comes from an fMRI study by jasmin et al. (in prepara- 
tion), in which subjects spoke prepared sentences in a variety of 
conditions, including speaking alone, listening, speaking in syn- 
chrony with the experimenter and speaking in synchrony with 
a recording of the experimenter. Importantly, subjects were not 
informed of the difference between the latter two conditions, and 
on debriefing, they were never aware that recordings were used at 
all. In contrasting the regional blood flow subsequent to speaking 
in the latter two synchronization conditions, a marked difference 
was found in macroscopic patterns of cortical activity, despite 
the obliviousness of subjects to the contrast. In particular, syn- 
chronization with a live person was characterized by an increase 
in activity in right hemisphere locations, including the tempo- 
ral pole, supramarginal gyrus, superior temporal gyrus, and the 
right hemisphere homolog of Broca's area — the latter three are 
areas that, in the left hemisphere, are reliably implicated in speech 
production activity. There is thus a large scale alteration to the 
well-known hemispheric asymmetry that attends speech produc- 
tion, but only when the speaker is coupled in real time to another 
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speaker, and not when the non-self voice has the inflexibility of a 
recording. 

5. VOICE, (INTER-)SUBJECTIVITY AND REAL TIME 
RECURRENT INTERACTION 

As scientists, there is a need to acknowledge that the metaphysical 
background within which one works makes some inquiries pos- 
sible, and some impossible. For all the acknowledged successes of 
the message passing view of language rooted in a Cartesian frame- 
work, there are very many familiar phenomena that have been 
passed over, or, at best, relegated to the outer wastelands of the 
non-cognitive and non-linguistic. I have here sought to work with 
a notion of the subject that is an emergent property of specific 
kinds of interpersonal interaction rooted in real time reciprocal 
exchange. This unconventional view of the subject brings with it a 
very different view of what language is, to the point where the sys- 
tematic formal system described by modern linguistics no longer 
appears to be describing the human capacity to create shared per- 
spective, to generate a shared common ground, and to bring forth 
a common world. Where received approaches to "language" treat 
of regularities found in sequences of symbols, I have focussed on 
the voice, uttered from a specific concerned perspective, and nec- 
essarily tied to the real time negotiation of a subjective pole. In the 
voice, we find a strong index of intentionality, but an intention- 
ally that shifts, that arises fluidly, that is sometimes grounded in 
an individual, sometimes in a negotiated context, and that some- 
times seems to emerge at the collective level in a manner no longer 
reducible to the thoughts, beliefs, and perspectives of the con- 
tributing individuals (Carr, 1987). This dissociation of the voiced 
subject from the solipsistic individual is seen perhaps most clearly 
in the case of joint speech. The emphasis on voice and inten- 
tionality serves to position the symbolic domain of structural 
and generative linguistics as a specific, limited, extrapolation, and 
codification of an older practice of uttering that has given rise to 
several distinct extensions and codifications in such domains as 
ritual and rite. 

The loosening of metaphysical commitments that results when 
we abandon the Cartesian subject offers the opportunity to recon- 
sider many phenomena, and joint speech provides an important 
and familiar case in point. The practice of joint speech is not 
restricted to any particular culture. As well as being ubiquitous, 
it is immediately apparant that the situations in which people 
speak collectively do not form an arbitrary or incoherent set. 
All such situations seem to provide strong evidence of collec- 
tively held beliefs, and it is through the collective voicing that this 
attribution becomes warranted. It might help here to note that 
the subjectivity being treated so rudely is not coextensive with 
the mind of an individual, nor with the idea of a cognitive sys- 
tem, conceived of as a set of sub-personal information processing 
mechanisms that some hypothesize to underlie observed behav- 
ior. The subject pole referred to here is an aggregate to whom it 
makes sense to attribute a limited range of intentions, and in par- 
ticular, beliefs. I am thus wielding the term "belief" here in a sense 
rather like the dispositional account provided by Ryle (1949). This 
flexible notion of the subject seems to work when applied to an 
individual, a conversing dyad, or a lynch mob, each of whom 
can be said to speak from a distinct position, with a specific 



perspective. In strenuously avoiding the Cartesian split between 
mind and world, we would do well to avoid adopting an overly 
rigid metaphysical position. Rather, if subjects admit of the kind 
of treatment proposed here, then an ontological lightness of touch 
that can encompass many kinds of intentional subjects seems 
warranted. 

The empirical phenomena described above strongly highlight 
the importance of real time dynamic interaction among people 
in generating the subject-pole to which beliefs can sensibly 
be attributed. The neural signature of collective speaking is 
found when speaking with a live speaker, but not with a record- 
ing (jasmin et al., in preparation). Live conversational partners 
become entangled not only in ways that fit a linguistic description 
(lexical priming, syntactic biasing, phonological, and phonetic 
imitation, Pickering and Garrod, 2004), but in a host of subtle 
ways that have hitherto been treated of as non-linguistic. These 
include gaze, posture, gestures, and blinks, but this set might 
conceivably be considerably extended as researchers turn their 
attention more and more to physiological markers of interaction 
(Campbell, 2007; Richardson et al, 2007; Shockley et al, 2009; 
Cummins, 2012; Wagner et al, 2014). The voice is an important 
part of the means by which a collective perspective is established 
and maintained, but it is one among many. The interaction of 
voice and gaze may play a particularly strong role in allowing the 
protracted sustainment of conditions of joint attention, which 
appears as a possible foundation for the shared intentionality 
required to ground a human cultural world (Tomasello et al., 
2005 ) 5 . 

The dynamic entanglement seen in conversation, and in 
joint speech, can be empirically described as a form of mutual 
coordination, whereby two or more participants display a tran- 
sient inter-dependence on many levels (Shockley et al., 2009; 
Fusaroli et al., 2014). This third-person account lends itself 
well to ethological and experimental observation and mod- 
eling. A well-worked mathematical framework for describing 
how autonomous systems that interact in real time can give 
rise to emergent phenomena at the collective level is available, 
e.g., as illustrated by the field of coordination dynamics (Kelso, 
1995; Oullier and Kelso, 2009). Social cognitive neuroscience has 
recently begun to recognize that nervous systems of interacting 
individuals behave quite differently from those of solitary sub- 
jects, and often become inter-dependent (Hari and Kujala, 2009; 
Babiloni and Astolfi, 2012; Schilbach et al., 2013). This opens up 
a vast empirical research agenda for the future. 

But the shifting ground of subjectivity that is here espoused 
poses challenges for description from a phenomenological or 
experiential point of view. Here, the recent concept of participa- 
tory sense-making may be of assistance (De Jaegher and Di Paolo, 
2007; Fuchs and De laegher, 2009). Participatory sense-making 
extrapolates from the basic enactive account that grounds sense- 
making (perception/action in the service of the generation of 
meaning) in the adaptive interaction of an autonomous agent 
with its environment (Froese and Di Paolo, 2011). Building on 



5 Small wonder then that the appearance of "language" appears utterly myste- 
rious from the vantage point of modern linguistics (Hauser et al., 2014). The 
discipline has defined its own subject almost out of existence. 
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this perspective, participatory sense-making describes how the 
moment-to-moment interaction of two subjects gives rise to a 
mutuality in their joint sense-making, allowing for the joint 
creation of meaning. On this account, the emergent domain con- 
stituted by the inter-dependent activities of two or more subjects 
warrants treatment as a phenomenological domain in its own 
right (Cummins, 2013b). Intersubjectivity then is the enactment 
of a novel phenomenological domain in the sustained, real time 
coordinated activities of two or more people. There appears to 
be a convergence of the theoretical vocabulary and the demands 
raised by empirical studies that bodes well for further scientific 
work. 

A host of open questions relate to the role of clock time and 
synchronized behavior. In collective speaking, we observe highly 
coordinated action that relies, not on a common external beat or 
timekeeper, but on shared knowledge among interactants. Highly 
synchronized behavior that is scaffolded by an external beat is 
also very common, as in music making, marching, or dancing, 
but this kind of collective entrainment does not seem to bring 
with it an automatic sense of commitment to underlying beliefs 
or intentions. We are all familiar with western school kids danc- 
ing happily to the religiously tinged beats of Bob Marley, without 
worrying about whether they really subscribe to the tenets of 
Rastafarianism. Much work remains to be done in gaining a better 
understanding of how collective coordinated behavior gives rise 
to collective intentionality, and what the necessary preconditions 
for that in the contributing individuals are. 

A willingness to countenance subjective poles that are not co- 
extensive with the individual person, and that rise and fade in 
a dynamic fashion, is incompatible with the grounding assump- 
tions of much of conventional psychology. Of course, psychology 
itself has grappled since its inception with the boundaries of the 
subject (Dewey, 1896). One way of describing the subject mat- 
ter of psychology is with reference to the twin poles of experience 
and behavior, for which a causal account is sought. This approach 
looks out at the world from a subject whose existence, persistence, 
and integrity is taken for granted. The approach taken here, and 
enabled by the enactive framework more generally, is to reverse 
the direction of inquiry, from a view toward experience (whose?) 
and behavior (by whom?), and to look instead at the shifting ref- 
erents of the personal pronouns "I," "we," "you," etc. It is here that 
it becomes apparent that the received view of language will not 
serve, any more than the notion of a solipsistic mind. Of course 
the contemporary scientific view of language is deeply rooted in a 
specific set of psychological commitments, and a view of mind 
as information processing, that together gave birth to the cog- 
nitivist worldview. Adopting a different stance with respect to 
the ground of experience must, it seems, go hand in hand with 
a willingness to question the boundaries that have traditionally 
served to demarcate the linguistic domain. This opens up the 
enticing prospect that we might begin to question, negotiate, and 
re-evaluate just what, and who, "we" think "we" are. 

In Seeger (2004), an account is provided of the way music and 
song are integrated into the lives of the Suya people of the Amazon 
basin. Some songs, the shout songs, are sung from what we might 
consider a conventional egocentric perspective. Others are sung 
in unison. Of these Seeger notes: 



The Suya men said they sang shout songs for their sisters. . . When 
I asked them for whom they sang unison songs, they responded 
that they simply sang them. They weren't for anyone. A man did 
not sing a unison song as a brother, lover, or individual. He sang 
it as a member of a group, whose identity was partly established 
through the song. Thus, they sang for a general audience: the act 
of singing was the statement. In some sense, invocations had no 
audience at all. . . (Seeger, 2004, p. 83) 
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